g03aaf

g03aaf © Numerical Algorithms Group, 2002.

Purpose

G03AAF Performs principal component analysis

Synopsis

[e,p,v,s,ifail] = g03aaf(x<,matrix,isx,s,wt,std,weight,ifail>)

Description

 
 Let X be a n by p data matrix of n observations on p variables 
 x ,x ,...,x  and let the p by p variance-covariance matrix of 
  1  2      p                                                 
 x ,x ,...,x  be S. A vector a  of length p is found such that:
  1  2      p                 1                           
 
       T                             T  
      a Sa  is maximized subject to a a =1.0
       1  1                          1 1
 
                  p                                       
                  --                                      
 The variable z = >  a  x  is known as the first principal 
               1  --  1i i                                
                  i=1                                     
 component and gives the linear combination of the variables that 
 gives the maximum variation. A second principal component, z = 
                                                             2 
  p                       
  --                      
  >  a  x , is found such that:
  --  2i i                
  i=1                     
 
       T                             T            T  
      a Sa  is maximized subject to a a =1.0 and a a =0.0
       2  2                          2 2          2 1
 
 This gives the linear combination of variables that is orthogonal
 to the first principal component that gives the maximum 
 variation. Further principal components are derived in a similar 
 way. The elements of the vectors a  are known as the principal 
                                   i                           
 component loadings.
 
 The vectors a ,a ,...,a , are the eigenvectors of the matrix S 
              1  2      p                                      
                                                                 2
 and associated with each eigenvector is the eigenvalue, (lambda).
                                                                 i
                      2  --        2                        
 The value of (lambda) / > (lambda)  gives the proportion of 
                      i  --        i                        
 variation explained by the ith principal component. Alternatively
 the a 's can be considered as the right singular vectors in a 
      i                                                       
 singular value decomposition with singular values (lambda)  of 
                                                           i   
 the data matrix centred about its mean and scaled by 1/(n-1). 
 This latter approach is used in G03AAF.
 
 Principal component analysis is often used to reduce the 
 dimension of a data set, replacing a large number of correlated 
 variables with a smaller number of orthogonal variables that 
 still contain most of the information in the original data set.
 
 The choice of the number of dimensions required is usually based 
 on the amount of variation accounted for by the leading principal
 components. If k principal components are selected then a test of
 the equality of the remaining p-k eigenvalues is
 
 
             {  p                        ( p                )}
             {  --            2          ( --        2      )}
 (n-(2p+5)/6){- > log((lambda) )+(p-k)log( > (lambda) /(p-k))}
             {  --            i          ( --        i      )}
             {  i=k+1                    ( i=k+1            )}
 
                                   2                  
 which has, asymptotically, a (chi)  distribution with 
  1                          
  -(p-k-1)(p-k+2) degrees of freedom.
  2                          
 
 Equality of the remaining eigenvalues indicates that if any more 
 principal components are to be considered then they all should be
 considered.
 
 Instead of the variance-covariance matrix the correlation matrix,
 the sums of squares and cross-products matrix or a standardised 
 sums of squares and cross-products matrix may be used. In the 

                                   -1/2         -1/2                     
 last case S is replaced by (sigma)     S(sigma)     for a 
                                                                
 diagonal matrix (sigma) with positive elements. If the 
                                     2
 correlation matrix is used the (chi)  approximation for the 
 statistic given above is not valid.
 
 The principal component scores are the values of the principal 
 component variables for the observations. These can be 
 standardised so that the variance of these scores for each 
 principal component is 1.0 or equal to the corresponding 
 eigenvalue. The principal component scores correspond to the 
 left-hand singular vectors.
 
 Weights can be used with the analysis, in which case the matrix X
 is first centred about the weighted means then each row is scaled
                 __                                    
 by an amount   /w , where w  is the weight for the ith 
              \/  i         i                          
 observation.

Parameters

g03aaf

Required Input Arguments:

x (:,:)                               real

Optional Input Arguments:                       <Default>

matrix (1)                            string   'v'
isx (:)                               integer  ones(size(x,2),1)
s (:)                                 real     zeros(size(x,2),1)
wt (:)                                real     zeros(size(x,1),1)
std (1)                               string   'u'
weight (1)                            string   'u'
ifail                                 integer  -1

Output Arguments:

e (:,6)                               real
p (:,:)                               real
v (:,:)                               real
s (:)                                 real
ifail                                 integer